Overview

Dataset statistics

Number of variables11
Number of observations972
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory83.7 KiB
Average record size in memory88.1 B

Variable types

Numeric7
Categorical4

Warnings

molecule_chembl_id has a high cardinality: 525 distinct values High cardinality
canonical_smiles has a high cardinality: 525 distinct values High cardinality
standard_value is highly correlated with standard_value_normHigh correlation
standard_value_norm is highly correlated with standard_valueHigh correlation
bioactivity_class is highly correlated with 0High correlation
0 is highly correlated with bioactivity_classHigh correlation
Unnamed: 0 is uniformly distributed Uniform
Unnamed: 0 has unique values Unique
NumHDonors has 153 (15.7%) zeros Zeros

Reproduction

Analysis started2021-01-30 17:45:59.512289
Analysis finished2021-01-30 17:46:12.108061
Duration12.6 seconds
Software versionpandas-profiling v2.10.0
Download configurationconfig.yaml

Variables

Unnamed: 0
Real number (ℝ≥0)

UNIFORM
UNIQUE

Distinct972
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean485.5
Minimum0
Maximum971
Zeros1
Zeros (%)0.1%
Memory size7.7 KiB

Quantile statistics

Minimum0
5-th percentile48.55
Q1242.75
median485.5
Q3728.25
95-th percentile922.45
Maximum971
Range971
Interquartile range (IQR)485.5

Descriptive statistics

Standard deviation280.7365313
Coefficient of variation (CV)0.578242083
Kurtosis-1.2
Mean485.5
Median Absolute Deviation (MAD)243
Skewness0
Sum471906
Variance78813
MonotocityStrictly increasing
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
9711
 
0.1%
3031
 
0.1%
3311
 
0.1%
3301
 
0.1%
3291
 
0.1%
3281
 
0.1%
3271
 
0.1%
3261
 
0.1%
3251
 
0.1%
3241
 
0.1%
Other values (962)962
99.0%
ValueCountFrequency (%)
01
0.1%
11
0.1%
21
0.1%
31
0.1%
41
0.1%
ValueCountFrequency (%)
9711
0.1%
9701
0.1%
9691
0.1%
9681
0.1%
9671
0.1%

molecule_chembl_id
Categorical

HIGH CARDINALITY

Distinct525
Distinct (%)54.0%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
CHEMBL267678
 
11
CHEMBL69638
 
7
CHEMBL69960
 
7
CHEMBL68920
 
7
CHEMBL320290
 
7
Other values (520)
933 

Length

Max length13
Median length12
Mean length11.74074074
Min length8

Characters and Unicode

Total characters11412
Distinct characters16
Distinct categories2 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique304 ?
Unique (%)31.3%

Sample

1st rowCHEMBL324340
2nd rowCHEMBL324340
3rd rowCHEMBL109600
4th rowCHEMBL357278
5th rowCHEMBL357119
ValueCountFrequency (%)
CHEMBL26767811
 
1.1%
CHEMBL696387
 
0.7%
CHEMBL699607
 
0.7%
CHEMBL689207
 
0.7%
CHEMBL3202907
 
0.7%
CHEMBL3211647
 
0.7%
CHEMBL607076
 
0.6%
CHEMBL1338976
 
0.6%
CHEMBL5434166
 
0.6%
CHEMBL3577285
 
0.5%
Other values (515)903
92.9%
Histogram of lengths of the category
ValueCountFrequency (%)
chembl26767811
 
1.1%
chembl696387
 
0.7%
chembl689207
 
0.7%
chembl699607
 
0.7%
chembl3211647
 
0.7%
chembl3202907
 
0.7%
chembl607076
 
0.6%
chembl5434166
 
0.6%
chembl1338976
 
0.6%
chembl4232575
 
0.5%
Other values (515)903
92.9%

Most occurring characters

ValueCountFrequency (%)
C972
 
8.5%
H972
 
8.5%
E972
 
8.5%
M972
 
8.5%
B972
 
8.5%
L972
 
8.5%
1851
 
7.5%
3707
 
6.2%
2684
 
6.0%
4574
 
5.0%
Other values (6)2764
24.2%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter5832
51.1%
Decimal Number5580
48.9%

Most frequent character per category

ValueCountFrequency (%)
1851
15.3%
3707
12.7%
2684
12.3%
4574
10.3%
5497
8.9%
6492
8.8%
7480
8.6%
0459
8.2%
8419
7.5%
9417
7.5%
ValueCountFrequency (%)
C972
16.7%
H972
16.7%
E972
16.7%
M972
16.7%
B972
16.7%
L972
16.7%

Most occurring scripts

ValueCountFrequency (%)
Latin5832
51.1%
Common5580
48.9%

Most frequent character per script

ValueCountFrequency (%)
1851
15.3%
3707
12.7%
2684
12.3%
4574
10.3%
5497
8.9%
6492
8.8%
7480
8.6%
0459
8.2%
8419
7.5%
9417
7.5%
ValueCountFrequency (%)
C972
16.7%
H972
16.7%
E972
16.7%
M972
16.7%
B972
16.7%
L972
16.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII11412
100.0%

Most frequent character per block

ValueCountFrequency (%)
C972
 
8.5%
H972
 
8.5%
E972
 
8.5%
M972
 
8.5%
B972
 
8.5%
L972
 
8.5%
1851
 
7.5%
3707
 
6.2%
2684
 
6.0%
4574
 
5.0%
Other values (6)2764
24.2%

canonical_smiles
Categorical

HIGH CARDINALITY

Distinct525
Distinct (%)54.0%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
Cc1ccc(Sc2cncc3sc(C(N)=O)cc23)cc1
 
11
Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1
 
7
Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(Nc3ccc(F)c(Cl)c3)c21
 
7
NC(Cc1ccnn1O)C(=O)O
 
7
NC(Cc1cnnn1O)C(=O)O
 
7
Other values (520)
933 

Length

Max length322
Median length49
Mean length59.24176955
Min length13

Characters and Unicode

Total characters57583
Distinct characters37
Distinct categories8 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique304 ?
Unique (%)31.3%

Sample

1st rowCc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c1
2nd rowCc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c1
3rd rowCOc1ccccc1-c1ccc2oc(-c3ccc(OC)c(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c1
4th rowCc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4ccc(Cl)c(C(F)(F)F)c4)CC3)ccc2s1
5th rowCc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)NCCc4ccccc4)CC3)ccc2s1
ValueCountFrequency (%)
Cc1ccc(Sc2cncc3sc(C(N)=O)cc23)cc111
 
1.1%
Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]17
 
0.7%
Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(Nc3ccc(F)c(Cl)c3)c217
 
0.7%
NC(Cc1ccnn1O)C(=O)O7
 
0.7%
NC(Cc1cnnn1O)C(=O)O7
 
0.7%
Nc1ncnc2c1c(-c1cccc(Oc3ccccc3)c1)cn2C1CCCC17
 
0.7%
CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o16
 
0.6%
O=C1C=C(c2c[nH]c3ccccc23)c2ccccc2C1=O6
 
0.6%
CC(C)[C@@H]1C(=O)N(S(C)(=O)=O)[C@@H]2CCN(C(=O)c3cnc(CNC4CC4)cn3)[C@@H]12.Cl6
 
0.6%
Cn1cc(NC(=O)c2cc(NC(=O)c3cc(NC=O)cn3C)cn2C)cc1C(=O)NCCC(=N)N5
 
0.5%
Other values (515)903
92.9%
Histogram of lengths of the category
ValueCountFrequency (%)
cc1ccc(sc2cncc3sc(c(n)=o)cc23)cc111
 
1.1%
cc1cc(c)c(/c=c2\c(=o)nc3ncnc(nc4ccc(f)c(cl)c4)c32)[nh]17
 
0.7%
nc1ncnc2c1c(-c1cccc(oc3ccccc3)c1)cn2c1cccc17
 
0.7%
cc1cc(c(=o)n2ccocc2)[nh]c1/c=c1\c(=o)nc2ncnc(nc3ccc(f)c(cl)c3)c217
 
0.7%
nc(cc1ccnn1o)c(=o)o7
 
0.7%
nc(cc1cnnn1o)c(=o)o7
 
0.7%
coc1cc(ccc2cnc3nc(n)nc(n)c3c2)cc(oc)c1oc6
 
0.6%
o=c1c=c(c2c[nh]c3ccccc23)c2ccccc2c1=o6
 
0.6%
ccoc1nn(-c2cccc(occ3ccccc3)c2)c(=o)o16
 
0.6%
cc(c)[c@@h]1c(=o)n(s(c)(=o)=o)[c@@h]2ccn(c(=o)c3cnc(cnc4cc4)cn3)[c@@h]12.cl6
 
0.6%
Other values (512)902
92.8%

Most occurring characters

ValueCountFrequency (%)
c12786
22.2%
C9952
17.3%
(5495
9.5%
)5495
9.5%
O3917
 
6.8%
12862
 
5.0%
=2411
 
4.2%
N2284
 
4.0%
22274
 
3.9%
31396
 
2.4%
Other values (27)8711
15.1%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter17738
30.8%
Lowercase Letter14517
25.2%
Decimal Number7232
12.6%
Open Punctuation6711
 
11.7%
Close Punctuation6711
 
11.7%
Math Symbol2460
 
4.3%
Other Punctuation1874
 
3.3%
Dash Punctuation340
 
0.6%

Most frequent character per category

ValueCountFrequency (%)
C9952
56.1%
O3917
 
22.1%
N2284
 
12.9%
H1076
 
6.1%
S238
 
1.3%
F212
 
1.2%
B30
 
0.2%
P20
 
0.1%
I8
 
< 0.1%
K1
 
< 0.1%
ValueCountFrequency (%)
12862
39.6%
22274
31.4%
31396
19.3%
4490
 
6.8%
5134
 
1.9%
648
 
0.7%
724
 
0.3%
84
 
0.1%
ValueCountFrequency (%)
c12786
88.1%
n1215
 
8.4%
l275
 
1.9%
s134
 
0.9%
o58
 
0.4%
r30
 
0.2%
a19
 
0.1%
ValueCountFrequency (%)
@1380
73.6%
/264
 
14.1%
.101
 
5.4%
\74
 
3.9%
#55
 
2.9%
ValueCountFrequency (%)
(5495
81.9%
[1216
 
18.1%
ValueCountFrequency (%)
=2411
98.0%
+49
 
2.0%
ValueCountFrequency (%)
)5495
81.9%
]1216
 
18.1%
ValueCountFrequency (%)
-340
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin32255
56.0%
Common25328
44.0%

Most frequent character per script

ValueCountFrequency (%)
(5495
21.7%
)5495
21.7%
12862
11.3%
=2411
9.5%
22274
9.0%
31396
 
5.5%
@1380
 
5.4%
[1216
 
4.8%
]1216
 
4.8%
4490
 
1.9%
Other values (10)1093
 
4.3%
ValueCountFrequency (%)
c12786
39.6%
C9952
30.9%
O3917
 
12.1%
N2284
 
7.1%
n1215
 
3.8%
H1076
 
3.3%
l275
 
0.9%
S238
 
0.7%
F212
 
0.7%
s134
 
0.4%
Other values (7)166
 
0.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII57583
100.0%

Most frequent character per block

ValueCountFrequency (%)
c12786
22.2%
C9952
17.3%
(5495
9.5%
)5495
9.5%
O3917
 
6.8%
12862
 
5.0%
=2411
 
4.2%
N2284
 
4.0%
22274
 
3.9%
31396
 
2.4%
Other values (27)8711
15.1%

standard_value
Real number (ℝ≥0)

HIGH CORRELATION

Distinct536
Distinct (%)55.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2667621.103
Minimum3.64 × 105
Maximum1000000000
Zeros0
Zeros (%)0.0%
Memory size7.7 KiB

Quantile statistics

Minimum3.64 × 105
5-th percentile0.8365
Q130
median681.5
Q310000
95-th percentile219850
Maximum1000000000
Range1000000000
Interquartile range (IQR)9970

Descriptive statistics

Standard deviation47313769.65
Coefficient of variation (CV)17.73631555
Kurtosis344.3504494
Mean2667621.103
Median Absolute Deviation (MAD)678.6
Skewness18.42635699
Sum2592927712
Variance2.238592798 × 1015
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
10000035
 
3.6%
1000032
 
3.3%
100030
 
3.1%
10020
 
2.1%
2000015
 
1.5%
1310
 
1.0%
610
 
1.0%
209
 
0.9%
48
 
0.8%
237
 
0.7%
Other values (526)796
81.9%
ValueCountFrequency (%)
3.64 × 1051
0.1%
0.000661
0.1%
0.000861
0.1%
0.0031
0.1%
0.00431
0.1%
ValueCountFrequency (%)
10000000001
0.1%
794328234.71
0.1%
741310241.31
0.1%
50000001
0.1%
3388441.561
0.1%

0
Categorical

HIGH CORRELATION

Distinct3
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
active
519 
inactive
288 
intermediate
165 

Length

Max length12
Median length6
Mean length7.611111111
Min length6

Characters and Unicode

Total characters7398
Distinct characters10
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowintermediate
2nd rowinactive
3rd rowintermediate
4th rowintermediate
5th rowinactive
ValueCountFrequency (%)
active519
53.4%
inactive288
29.6%
intermediate165
 
17.0%
Histogram of lengths of the category
ValueCountFrequency (%)
active519
53.4%
inactive288
29.6%
intermediate165
 
17.0%

Most occurring characters

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter7398
100.0%

Most frequent character per category

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring scripts

ValueCountFrequency (%)
Latin7398
100.0%

Most frequent character per script

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring blocks

ValueCountFrequency (%)
ASCII7398
100.0%

Most frequent character per block

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

MW
Real number (ℝ≥0)

Distinct507
Distinct (%)52.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean449.9056759
Minimum122.171
Maximum2040.418
Zeros0
Zeros (%)0.0%
Memory size7.7 KiB

Quantile statistics

Minimum122.171
5-th percentile214.22265
Q1323.41475
median389.495
Q3484.578
95-th percentile850.035
Maximum2040.418
Range1918.247
Interquartile range (IQR)161.16325

Descriptive statistics

Standard deviation249.0827076
Coefficient of variation (CV)0.5536331746
Kurtosis13.65725735
Mean449.9056759
Median Absolute Deviation (MAD)85.7605
Skewness3.173680051
Sum437308.317
Variance62042.19525
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
300.40811
 
1.1%
482.9037
 
0.7%
370.4567
 
0.7%
383.8147
 
0.7%
172.1447
 
0.7%
171.1567
 
0.7%
312.3256
 
0.6%
448.5016
 
0.6%
273.2916
 
0.6%
378.5166
 
0.6%
Other values (497)902
92.8%
ValueCountFrequency (%)
122.1713
0.3%
126.1111
 
0.1%
134.613
0.3%
137.1381
 
0.1%
162.1442
0.2%
ValueCountFrequency (%)
2040.4182
0.2%
1967.0393
0.3%
1880.8983
0.3%
1632.4843
0.3%
1550.2433
0.3%

LogP
Real number (ℝ)

Distinct516
Distinct (%)53.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.075184074
Minimum-10.0734
Maximum12.9701
Zeros0
Zeros (%)0.0%
Memory size7.7 KiB

Quantile statistics

Minimum-10.0734
5-th percentile-0.80272
Q11.992675
median3.2381
Q34.476915
95-th percentile6.233831
Maximum12.9701
Range23.0435
Interquartile range (IQR)2.48424

Descriptive statistics

Standard deviation2.373646032
Coefficient of variation (CV)0.7718712036
Kurtosis4.789317237
Mean3.075184074
Median Absolute Deviation (MAD)1.2415
Skewness-0.9089335509
Sum2989.07892
Variance5.634195487
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
3.8548211
 
1.1%
4.450347
 
0.7%
3.614327
 
0.7%
-1.53027
 
0.7%
-0.92527
 
0.7%
5.58797
 
0.7%
3.3656
 
0.6%
0.80756
 
0.6%
2.80326
 
0.6%
4.36256
 
0.6%
Other values (506)902
92.8%
ValueCountFrequency (%)
-10.07342
0.2%
-9.775623
0.3%
-4.91181
 
0.1%
-4.18721
 
0.1%
-3.60593
0.3%
ValueCountFrequency (%)
12.97012
0.2%
10.94551
 
0.1%
9.20283
0.3%
8.90243
0.3%
8.88711
 
0.1%

NumHDonors
Real number (ℝ≥0)

ZEROS

Distinct19
Distinct (%)2.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.577160494
Minimum0
Maximum22
Zeros153
Zeros (%)15.7%
Memory size7.7 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median2
Q33
95-th percentile8
Maximum22
Range22
Interquartile range (IQR)2

Descriptive statistics

Standard deviation3.259654592
Coefficient of variation (CV)1.264824057
Kurtosis10.95514316
Mean2.577160494
Median Absolute Deviation (MAD)1
Skewness3.010250001
Sum2505
Variance10.62534806
MonotocityNot monotonic
Histogram with fixed size bins (bins=19)
ValueCountFrequency (%)
1307
31.6%
2198
20.4%
0153
15.7%
3132
13.6%
447
 
4.8%
630
 
3.1%
729
 
3.0%
524
 
2.5%
1212
 
1.2%
106
 
0.6%
Other values (9)34
 
3.5%
ValueCountFrequency (%)
0153
15.7%
1307
31.6%
2198
20.4%
3132
13.6%
447
 
4.8%
ValueCountFrequency (%)
222
 
0.2%
213
0.3%
186
0.6%
173
0.3%
163
0.3%

NumHAcceptors
Real number (ℝ≥0)

Distinct22
Distinct (%)2.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.055555556
Minimum0
Maximum28
Zeros4
Zeros (%)0.4%
Memory size7.7 KiB

Quantile statistics

Minimum0
5-th percentile2
Q14
median5
Q37
95-th percentile13
Maximum28
Range28
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.892610973
Coefficient of variation (CV)0.6428164909
Kurtosis11.41043947
Mean6.055555556
Median Absolute Deviation (MAD)2
Skewness2.809169415
Sum5886
Variance15.15242019
MonotocityNot monotonic
Histogram with fixed size bins (bins=22)
ValueCountFrequency (%)
5161
16.6%
4158
16.3%
6126
13.0%
3123
12.7%
7120
12.3%
868
7.0%
265
6.7%
937
 
3.8%
1037
 
3.8%
1413
 
1.3%
Other values (12)64
 
6.6%
ValueCountFrequency (%)
04
 
0.4%
19
 
0.9%
265
6.7%
3123
12.7%
4158
16.3%
ValueCountFrequency (%)
285
0.5%
276
0.6%
253
0.3%
203
0.3%
193
0.3%

standard_value_norm
Real number (ℝ≥0)

HIGH CORRELATION

Distinct534
Distinct (%)54.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean367581.5189
Minimum3.64 × 105
Maximum100000000
Zeros0
Zeros (%)0.0%
Memory size7.7 KiB

Quantile statistics

Minimum3.64 × 105
5-th percentile0.8365
Q130
median681.5
Q310000
95-th percentile219850
Maximum100000000
Range100000000
Interquartile range (IQR)9970

Descriptive statistics

Standard deviation5554801.844
Coefficient of variation (CV)15.11175496
Kurtosis318.7381441
Mean367581.5189
Median Absolute Deviation (MAD)678.6
Skewness17.86551864
Sum357289236.4
Variance3.085582352 × 1013
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
10000035
 
3.6%
1000032
 
3.3%
100030
 
3.1%
10020
 
2.1%
2000015
 
1.5%
610
 
1.0%
1310
 
1.0%
209
 
0.9%
48
 
0.8%
237
 
0.7%
Other values (524)796
81.9%
ValueCountFrequency (%)
3.64 × 1051
0.1%
0.000661
0.1%
0.000861
0.1%
0.0031
0.1%
0.00431
0.1%
ValueCountFrequency (%)
1000000003
0.3%
50000001
 
0.1%
3388441.561
 
0.1%
32000001
 
0.1%
30000001
 
0.1%

bioactivity_class
Categorical

HIGH CORRELATION

Distinct3
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
active
519 
inactive
288 
intermediate
165 

Length

Max length12
Median length6
Mean length7.611111111
Min length6

Characters and Unicode

Total characters7398
Distinct characters10
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowintermediate
2nd rowinactive
3rd rowintermediate
4th rowintermediate
5th rowinactive
ValueCountFrequency (%)
active519
53.4%
inactive288
29.6%
intermediate165
 
17.0%
Histogram of lengths of the category
ValueCountFrequency (%)
active519
53.4%
inactive288
29.6%
intermediate165
 
17.0%

Most occurring characters

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter7398
100.0%

Most frequent character per category

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring scripts

ValueCountFrequency (%)
Latin7398
100.0%

Most frequent character per script

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Most occurring blocks

ValueCountFrequency (%)
ASCII7398
100.0%

Most frequent character per block

ValueCountFrequency (%)
i1425
19.3%
e1302
17.6%
t1137
15.4%
a972
13.1%
c807
10.9%
v807
10.9%
n453
 
6.1%
r165
 
2.2%
m165
 
2.2%
d165
 
2.2%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

Unnamed: 0molecule_chembl_idcanonical_smilesstandard_value0MWLogPNumHDonorsNumHAcceptorsstandard_value_normbioactivity_class
00CHEMBL324340Cc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c12500.000intermediate398.3744.302021.05.02500.000intermediate
11CHEMBL324340Cc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c150000.000inactive398.3744.302021.05.050000.000inactive
22CHEMBL109600COc1ccccc1-c1ccc2oc(-c3ccc(OC)c(N4C(=O)c5ccc(C(=O)O)cc5C4=O)c3)nc2c19000.000intermediate520.4975.677801.07.09000.000intermediate
33CHEMBL357278Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4ccc(Cl)c(C(F)(F)F)c4)CC3)ccc2s14000.000intermediate543.0114.272922.07.04000.000intermediate
44CHEMBL357119Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)NCCc4ccccc4)CC3)ccc2s117000.000inactive468.6232.320922.07.017000.000inactive
55CHEMBL152968Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4cccc(-c5ccccc5)c4)CC3)ccc2s1180.000active516.6674.267722.07.0180.000active
66CHEMBL152968Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4cccc(-c5ccccc5)c4)CC3)ccc2s16000.000intermediate516.6674.267722.07.06000.000intermediate
77CHEMBL152968Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4cccc(-c5ccccc5)c4)CC3)ccc2s137000.000inactive516.6674.267722.07.037000.000inactive
88CHEMBL152968Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4cccc(-c5ccccc5)c4)CC3)ccc2s124000.000inactive516.6674.267722.07.024000.000inactive
99CHEMBL429345Nc1nc(O)c2c(n1)NCC(CCOc1ccc(C(=O)NC(CCC(=O)O)C(=O)O)cc1)N20.009active460.4470.487307.010.00.009active

Last rows

Unnamed: 0molecule_chembl_idcanonical_smilesstandard_value0MWLogPNumHDonorsNumHAcceptorsstandard_value_normbioactivity_class
962962CHEMBL1204215Cc1cc2nc3ccc(C(=O)NCCCCc4cccnc4)cn3c(=O)c2cc1C.Cl.Cl100.0active473.4044.455841.05.0100.0active
963963CHEMBL542070CCN(CC)CCNc1ccc2c3c(nn2CCO)-c2ccccc2C(=O)c13.Cl880.0active414.9373.415502.06.0880.0active
964964CHEMBL542566CN(C)CCNc1ccc2c3c(nn2CCNCCO)-c2c(O)ccc(O)c2C(=O)c13.Cl230.0active461.9501.636105.09.0230.0active
965965CHEMBL23932CCN1CCC[C@H]1CNC(=O)c1c(OC)ccc(Cl)c1OC3190.0intermediate326.8242.571301.04.03190.0intermediate
966966CHEMBL316884O=C(CN(CCCCc1ccccc1)C(=O)CN(CCCCc1ccccc1)C(=O)Nc1ccc(Oc2ccccc2)cc1)NO52000.0inactive622.7666.692603.05.052000.0inactive
967967CHEMBL87187Nc1cccc(-c2ccc(CCN3CCN(c4cccc5cccnc45)CC3)cc2)n1130.0active409.5374.243701.05.0130.0active
968968CHEMBL87187Nc1cccc(-c2ccc(CCN3CCN(c4cccc5cccnc45)CC3)cc2)n1212.0active409.5374.243701.05.0212.0active
969969CHEMBL87187Nc1cccc(-c2ccc(CCN3CCN(c4cccc5cccnc45)CC3)cc2)n1825.0active409.5374.243701.05.0825.0active
970970CHEMBL62565c1cnc(N2CCN(Cc3cccc4c3sc3ccccc34)CC2)nc1100.0active360.4864.166700.05.0100.0active
971971CHEMBL120853C[C@@H]1C(=O)N(C(=O)NCc2ccccc2)[C@@H]1Oc1ccc(C(=O)O)cc1100000.0inactive354.3622.477802.04.0100000.0inactive